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Abstract 

We consider the problem of online stratified sampling for Monte Carlo integration of a 
function given a finite budget of n noisy evaluations to the function. More precisely we 
focus on the problem of choosing the number of strata if as a function of the budget n. 
We provide asymptotic and finite-time results on how an oracle that has access to the 
function would choose the partition optimally. In addition we prove a lower bound on the 
learning rate for the problem of stratified Monte-Carlo. As a result, we are able to state, by 
improving the bound on its performance, that algorithm MC-UCB, defined in (Carpentier 
and Munos, 2011a), is minimax optimal both in terms of the number of samples n and the 
number of strata K, up to a \J\og{nK ). This enables to deduce a minimax optimal bound 
on the difference between the performance of the estimate outputted by MC-UCB, and the 
performance of the estimate outputted by the best oracle static strategy, on the class of 
Holder continuous functions, and upt to a yJ\og(n). 

Keywords: Online learning, stratified sampling, Monte Carlo integration, regret bounds. 
1. Introduction 

The objective of this paper is to provide an efficient strategy for Monte-Carlo integration 
of a function / over a domain [0, l] d . We assume that we can query the function n times. 
Querying the function at a time t and at a point xt G [0, l] d provides a noisy sample 

f(x t ) + s(x t )e t , (1) 

where et is an independent sample drawn from v Xt . Here v x is a distribution with mean 0, 
variance 1 and whose shape may depend on x , This model is actually very general (see 
Section 2). 

Stratified sampling is a well-known strategy to reduce the variance of the estimate of 
the integral of /, when compared to the variance of the estimate provided by crude Monte- 
Carlo. The principle is to partition the domain in K subsets called strata and then to 
sample in each stratum (see (Rubinstein and Kroese, 2008) [Subsection 5.5] or (Glasserman, 
2004)). If the variances of the strata are known, there exists an optimal static allocation 
strategy which allocates the samples proportionally to the measure of the stratum times 
their standard deviation (see Equation 3 in this paper for a reminder). We refer to this 

1. It is the usual model for functions in heterocedastic noise. We isolate the standard deviation on a point 
x, s(x), in the expression of the noise, since this quantity is very relevant. 
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allocation as optimal oracle strategy for a given partition. In the case that the variations 
of / and the standard deviation of the noise s are unknown, it is not possible to adopt this 
strategy. 

Consider first that the partition of the space is fixed. A way around this problem is 
to estimate the variations of the function and the amount of noise on the function in the 
strata online (exploration) while allocating the samples according to the estimated optimal 
oracle proportions (exploitation). This setting is considered in (Etore and Jourdain, 2010; 
Grover, 2009; Carpentier and Munos, 2011a). In the long version (Carpentier and Munos, 
2011b) of the last paper, the authors propose the so-called MC-UCB algorithm which is 
based on Upper-Confidence-Bounds (UCB) on the standard deviation. They provide up- 
per bounds for the difference between the mean-squared error 2 of the estimate provided 
by MC-UCB and the mean-squared error of the estimate provided by the optimal oracle 
strategy (optimal oracle variance). The algorithm performs almost as well as the optimal 
oracle strategy. However, the authors of (Carpentier and Munos, 2011b) do not infirm nor 
assess the optimality of their algorithm with a lower bound as benchmark. As a matter 
of fact, no lower bound on the rate of convergence (to the oracle optimal strategy) for the 
problem of stratified Monte-Carlo exists, to the best of our knowledge. Still in the same 
paper (Carpentier and Munos, 2011b), the authors do not at all discuss on how to stratify 
the space. In particular, they do not pose the problem of what an optimal oracle partition 
of the space is, and do not try to answer on whether it is possible or not to attain it. 

The next step is thus to efficiently design the partition. There are some interesting 
papers on that topic such that (Glasserman et al., 1999; Kawai, 2010; Etore et al., 2011). 
The recent, state of the art, work of Etore et al. (2011) describes a strategy that samples 
asymptotically almost as efficiently as the optimal oracle strategy, and at the same time 
adapts the direction and number of the strata online. This is a very difficult problem. 
The authors do not provide proofs of convergence of their algorithm. However for static 
allocation of the samples, they present some properties of the stratified estimate when the 
number of strata goes to infinity and provide convergence results under the optimal oracle 
strategy. As a corollary, they prove that the more strata there are, the smallest the optimal 
oracle variance. 

Contributions: The more strata there are, the smaller the variance of the estimate com- 
puted when following the optimal oracle strategy. However, the more strata there are, the 
more difficult it is to estimate the variance within each of these strata, and thus the more 
difficult it is to perform almost as well as the optimal oracle strategy. Choosing the number 
of strata is thus crucial and this is the problem we address in this paper. This defines a 
trade-off similar to the one in model selection (and in all its variants, e.g. density estimation, 
regression...): The wider the class of models considered, i.e. the larger the number of strata, 
the smaller the distance between the true model and the best model of the class, i.e. the 
approximation error. But the larger the estimation error. 

Paper (Etore et al., 2011), although proposing no finite time bounds, develops very inter- 
esting ideas for bounding the first term, i.e. the approximation error. As pointed out in 
paper e.g. (Carpentier and Munos, 2011a), it is possible to build algorithms that have a 
small estimation error. By constructing tight and finite-time bounds for the approximation 

2. The mean squared error is measured with respect to the quantity of interest, i.e. the integral of /. 
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error, it is thus possible to propose a number of strata that minimizes an upper bound on 
the performance. It is however not clear how consistent this choice is, i.e. how much it can 
be improved. The essential ingredients for choosing efficiently a partition are thus lower 
bounds on the estimation error, and on the approximation error. 

The objective of this paper is to propose a method for choosing the minimax-optimal 
number of strata. Our contributions are the following. 

• We first present results on what we call the quality Q n j^I of a given partition in K 
strata N (i.e., using the previous analogy to model selection, this would represent the 
approximation error) . Using very mild assumptions we compute a lower bound on the 
variance of the estimate given by the optimal oracle strategy on the optimal oracle 
partition. Then if the function and the standard deviation of the noise are a— Holder, 
and also if the strata satsfy some assumptions, we prove that Q n jj = 0( K n / ). This 
bound is also minimax optimal on the class of a— Holder functions. 

• We then present results on the estimation error for the estimate outputted by algo- 
rithm MC-UCB of (Carpentier and Munos, 2011a) (pseudo-regret in the terminology 
of (Carpentier and Munos, 2011a)). In this paper, we improve the analysis of the MC- 
UCB algorithm when compared to paper (Carpentier and Munos, 2011a) in terms of 
the dependence on K. The problem independent bound on the pseudo-regret in (Car- 
pentier and Munos, 2011a) is of order 3 0(iv~n _4//3 ), and we tighten this bound in this 
paper so that it is of order 0(ii" 1//3 n _4 / 3 ). 

• We provide the first lower bound (on the pseudo-regret) for the problem of online 
Stratified Sampling. The bound f^ET 1 / 3 ™ -4 / 3 ) is tight and matches the upper-bound 
of MC- UCB both in terms of the number of strata and the number of samples. This is 
the main contribution of the paper, and we believe that the proof technique for this 
bound is original. 

• Finally, we combine the results on the quality and on the pseudo-regret of MC-UCB 
to provide a value on the number of strata leading to a minimax-optimal trade-off (up 
to a y/\og(nj) on the class of a— Holder functions. 

The rest of the paper is organized as follows. In Section 2 we formalize the problem 
and introduce the notations used throughout the paper. Section 3 states the results on 
the quality of a partition. Section 4 improves the analysis of the MC-UCB algorithm, and 
establishes the lower bound on the pseudo- regret. Section 5 reports the best trade-off to 
choose the number of strata. And in Section 6, we illustrate how important it is to choose 
carefully the number of strata. We finally conclude the paper and suggest future works. 

2. Setting 

We consider the problem of numerical integration of a function / : [0, l] d — > M. with respect 
to the uniform (Lebesgue) measure. We dispose of a budget of n queries (samples) to the 
function, and we can allocate this budget sequentially. When querying the function at a 
time t and at a point xt, we receive a noisy sample X(t) of the form described in Equation 1. 

3. Here O is a O up to a polynomial log(n). 
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We now assume that the space is stratified in K Lebesgue measurable strata that form 
a partition N . We index these strata, called Q^, with indexes k £ {1, . . . , K}, and write 
Wk their measure, according to the Lebesgue measure. We write = ^- Jq ^e~v x [f(x) + 
s(x)e]dx = ^ J Qk f(x)dx their mean and a\ = ^ E er ^J(/(x) + s(x)e - fik) 2 ]dx their 
variance. These mean and variance correspond to the mean and variance of the random 
variable X{t) when the coordinate x at which the noisy evaluation of / is observed is chosen 
uniformly at random on the stratum £l k . 

We denote by A an algorithm that allocates online the budget by selecting at each time 
step 1 < t < n the index k t £ { 1 , . . . , K } of a stratum and then sampling uniformly the 
corresponding stratum f2/% t . The objective is to return the best possible estimate jX n of 
the integral of the function /. We write T^ n = J^< n I{^ = k} the number of samples in 
stratum Q, k up to time n. We denote by {Xk,t) 1<k<K 1<t<Tk the samples in stratum £l k , 

and we define = Y^t=i ^k,t the empirical means. We estimate the integral of / by 

= Ljk=l w k^k,n- 

If we allocate a deterministic number of samples T k to each stratum fl k and if the 
samples are independent and chosen uniformly on each stratum Q k , we have 

E (An) = V w k fi k = V / f(u)du = / f(u)du = n, 

k<K k<K J ^ J i°^ d 

and also 

wlal 



k<K 

where the expectation and the variance are computed according to all the samples that the 
algorithm collected. 

For a given algorithm A allocating T k<n samples drawn uniformly within stratum $l k , 
we denote by pseudo-risk the quantity 

LnJtlA) = £ f^. (2) 
k<K k ' n 

Note that if an algorithm A* has access the variances a\ of the strata, it can choose 
to allocate the budget in order to minimize the pseudo-risk, i.e., sample each stratum 
T k = Y^^w-a- n times (this is the so-called oracle allocation). These optimal numbers of 
samples can be non-integer values, in which case the proposed optimal allocation is not 
realizable. But we still use it as a benchmark. The pseudo-risk for this algorithm (which is 
also the variance of the estimate here since the sampling strategy is deterministic) is then 



( T.k<K w k<?k) 



LnM**) = = — , (3) 

n n 

where E/v" = J2k<K w kO~k- We also refer in the sequel as optimal proportion to \ k = 
^ WkCTk — , and to optimal oracle strategy to this allocation strategy. Although, as already 
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mentioned, the optimal allocations (and thus the optimal pseudo-risk) might not be real- 
izable, it is still very useful in providing a lower-bound. No static (even oracle) algorithm 
has a pseudo-regret lower than L n j^{A*) on partition Af. 

It is straightforward to see that the more refined the partition M the smaller L n ^(A*). 
We thus define the quality of a partition Q n ,j\f as the difference between the variance 
L n ,j\f(A*) of the estimate provided by the optimal oracle strategy on partition J\f, and 
the infimum of the variance of the optimal oracle strategy on any partition (optimal oracle 
partition) (with an arbitrary number of strata): 

Qn^f = L n ^{A*)- inf L nM >{A*). (4) 

N measurable 

We also define the pseudo-regret of an algorithm A on a, given partition J\f, the difference 
between its pseudo-risk and the variance of the optimal oracle strategy: 

Rn,N{A) = L n ^{A) - L n ^{(A*). (5) 

We will assess the performance of an algorithm A by comparing its pseudo risk to the 
minimum possible variance of an optimal oracle strategy on the optimal oracle partition: 

L n ^{A) - inf L n jji(A*) = Rn^^A) + QnM- ( 6 ) 

Ai 'measurable 

Using the analogy of model selection mentioned in the Introduction, the quality Q n ,j\f 
is similar to the approximation error and the pseudo-risk R n ^(A) to the estimation error. 

Motivation for the model f(x) + s(x)ef Assume that a learner can, at each time t, 
choose a point x and collect an observation F(x,Wt), where Wt is an independent noise, 
that can however depend on i. It is the general model for representing evaluations of a 
noisy function. There are many settings where one needs to integrate accurately a noisy 
function without wasting too much budget, like for instance pollution survey. Set f(x) = 
K\v t [F(x, Wt)], and s(x)et = F(x, Wt)-f(x). Since by definition e t is of mean and variance 
1, we have in fact s(x) = \/^ x [(F(x, Wt) — f(x)) 2 ] and et = • Observing 

F(x, Wt) is equivalent to observing f(x) + s(x)et, and this implies that the model that we 
choose is also very general. 

There is also a important setting where this model is relevant, and this is for the integration 
of a function F in high dimension d* . Stratifying in dimension d* seems hopeless, since 
the budget n has to be exponential with d* if one wants to stratify in every direction of 
the domain: this is the curse of dimensionality. It is necessary to reduce the dimension 
by choosing a small amount of directions (l,...,d) that are particularly relevant, and 
control/stratify only in these d directions 4 . Then the control/stratification is only on the 
first d coordinates, so when sampling at at a time t, one chooses x = (x\, . . . , Xd), and 
the other d* — d coordinates U(t) = (Ud+i(t), . . . , Ud*{t)) are uniform random variables on 
[0, l] d ~ d (without any control). When sampling in x at a time t, we observe F(x,U(t)). 
By writing f(x) = ^u(t)^u([0,i\ d *- d )[ F { x > u (t))], and s(x)e t = F(x,U(t)) - f(x), we obtain 
that the model we propose is also valid in this case. 

4. This is actually a very common technique for computing the price of options, see (Glasserman, 2004). 
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3. The quality of a partition: Analysis of the term Q n j^. 

In this Section, we focus on the quality of a partition defined in Section 2. 

Convergence under very mild assumptions As mentioned out in Section 2, the more 
refined the partition M of the space, the smaller L n j^{A*), and thus E/y. Through this 
monotony property, we know that inf^/S^/ is also the limit of the (Sjv ) p of a sequence of 
partitions (J\f p ) p such that the diameter of each stratum goes to 0. We state in the follow- 
ing Proposition that for any such sequence, linip-^+oo £/v p = Jj ^ d s{x)dx. Consequently 
mf^Y,;^ = Jj 0jl j d s(x)dx. 

Proposition 1 Let (N p ) p = ($lk,p)kz{l,...,K },pe{i,...,+oo} ^ e a sequence of measurable parti- 
tions (where K p is the number of strata of partition M p ) such that 

• AS1: < Wk :P < v p , for some sequence (v p ) p , where v p — > for p — > +oo. 

• AS2: The diameters according to the ||.||2 norm on M. d of the strata are such that 
max/% Diam(Qk, P ) < D(wk p ), for some real valued function D(-), such that D(w) — > 
for w — > 0. 

If the functions m and s are in L«2([0, l] d ), then 

lim Ejv; = , inf Sat = / s(x)dx, 

p— >+oo * JVmeasurable J[0 1] d 

which implies that n x Q n j^ — > for p — > +oo. 

Proof [Sketch of Proof. The full proof is in the Supplementary material (Appendix B)] 
The form of the model and the definition of <t/% imply that 

°k = — f (/(*)-—/ f(u)dufdx + — f s(x) 2 dx. (7) 
Wk Jn k w k Jn k Wk Jn k 

We first prove that the result hold for uniformly continuous functions, and then generalize 
to L2 functions based on a density argument. 

Step 1: Convergence when m and s are uniformly continuous: Assume that m and 
s are uniformly continuous with respect to the ||.||2 norm. For any v > 0, there exists 
77 s.t. Vx, \ s(x + u) — s(x)\ < v and \f(x + u) — f(x)\ < v where u £ B^^iv) ■ We choose K 
large enough so that the size of the strata is smaller than v, and their diameter is smaller 
than 77 (it is possible to do so since the diameter of the strata shrinks to as K — > 00). 
From Equation 7 we deduce that 

4- i-l s?<2v\ 

and using the concavity of the square-root function, we have Ylk w k°~k — Jj ]id s < V%v, 
which concludes the proof for uniformly continuous functions. 

Step 2: Generalization to the case where f and s are in L 2 ([0, l) d ): From the density 
property of the uniformly continuous functions inL2([0, l] d ) (with respect to the ||.||2 norm), 



6 



Minimax Number of Strata for Online Stratified Sampling given Noisy Samples 



we deduce that for any K and v, there exists two uniformly continuous function f v and s t 
such that: 



K 

fe=l 



yZV™k\ I (fv(x)+ [ f v {u)du) dx / 

fr[ V Jn k \ Jn k > w k J n> 



s 2 ,(x)dx 



and also that \s(x) — s v {x)\dx < One concludes by combining those two inequalities 
with Step 1. ■ 



In Proposition 1, even though the optimal oracle allocation might not be realizable (in 
particular if the number of strata is larger than the budget), we can still compute the quality 
of a partition, as defined in 4. It does not correspond to any reachable pseudo-risk, but 
rather to a lower bound on any (even oracle) static allocation. 

When / and s are in L2QO, l] d ), for any appropriate sequence of partitions (M p ) p , Sjv" p 
(which is the principal ingredient of the variance of the optimal oracle allocation) converges 
to the smallest possible £jv for given / and s. Note however that this condition is not 
sufficient to obtain a rate. 

Finite-Time analysis under Holder assumption: We make the following assumption 
on the functions / and s. 

Assumption 1 The functions f and s are (M, a)— Holder continuous, i.e., for g £ {m, s}, 
for any x and y € [0, l] d , \g(x) — g(y)\ < M\\x — y\\2- 

The Holder assumption enables to consider arbitrarily non-smooth functions (for small 
a, the function can vary arbitrarily fast), and is thus a fairly general assumption. 
We also consider the following partitions in K squared strata. 

Definition 2 We write Mk the partition of [0, l] d in K hyper-cubic strata of measure 
Wk = w = and side length (^) 1 ^ d : we assume for simplicity that there exists an integer 
I such that K = l d . 

The following Proposition holds. 

Proposition 3 Under Assumption 1 we have for any partition Mk as defined in Defini- 
tion 2 that 

V Mk ~ [ s(x)dx < V2dM(h a / d , (8) 

which implies 



where M\ stands for the "partition" with one stratum. 

Proof [Sketch of Proof for Proposition 3] We deduce from Assumption 1 that 

— / (f(x)-— [ f(u)du) 2 dx + — [ s 2 (x)dx - (— [ s{u)du) 2 <2M 2 d{^-) 2a ' d . 
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Then, by using Equation 7 and by summing over all strata, we deduce Equation 8. Now 
the result on the quality follows from the fact that — ( Jj ^ d s(x)dx) = (^%- K — 

(I[o,i]d s ( x ) dx ) 2 )( T 'lf K + (J[o,i] d s(x)dx) 2 ) < 2Ejvi(SjVK - J [0i i]d s(x)dx). U 
The full proof for this Proposition is in the Supplementary Material (Appendix C). 

3.1. General comments 

The impact of a and d: The quantity Q n ,Af K increases with the dimension d, because 

the Holder assumption becomes less constraining when d increases. This can easily be seen 

since a squared strata of measure w has a diameter of order w l / d . Q n ,j\f K decreases with the 

smoothness a of the function, which is a logic effect of the Holder assumption. Note also 

that when defining the partitions Mk hi Definition 2, we made the crucial assumption that 
K i/d 

is an integer. This fact is of little importance in small dimension, but will matter in 
high dimension, as we will enlighten in the last remark of Section 5. 

Minimax optimality of this rate: The rate n~ l K~ a / d is minimax optimal on the class 
of a— Holder functions since for any n and K one can easily build a function with Holder 
exponent a such that the corresponding Sjv K is at least Jj Q 1 j d s{x)dx + cK~ a l d for some 
constant c. 

Discussion on the shape of the strata: Whatever the shape of the strata, as long as 
their diameter goes to 5 , X/VW converges to Jj ^ d s{x)dx. The shape of the strata have an 
influence only on the negligible term, i.e. the speed of convergence to this quantity. This 
result was already made explicit, in a different setting and under different assumptions, in 
(Etore et al., 2011). Choosing small strata of same shape and size is also minimax optimal 
on the class of Holder functions. Working on the shape of the strata could, however, improve 
the speed of convergence in some specific cases, e.g. when the noise is very localized. It 
could also be interesting to consider strata of varying size, and make this size depend on 
the specific problem. 

The decomposition of the variance: Note that the variance a\ within each stra- 
tum f2fc comes from two sources. First, a\ comes from the noise, that contributes to 
it by ^- s(x) 2 dx. Second, the mean / is not a constant function, thus its contri- 
bution to o\ is ^- (f(x) — ^jj- j Qk f(u)du) 2 dx. Note that when the size of ilk goes 
to 0, this later contribution vanishes, and the optimal allocation is thus proportional to 
\J w k Jn fc s(x) 2 dx + o(l) = J^ fe s(x)dx + o(l). This means that for small strata, the varia- 
tion in the mean are negligible when compared to the variation due to the noise. 

4. Algorithm MC-UCB and a matching lower bound 
4.1. Algorithm MC - UCB 

In this Subsection, we describe a slight modification of the algorithm MC —UCB introduced 
in (Carpentier and Munos, 2011a). The only difference is that we change the form of the 

5. And note that in this noisy setting, if the diameter of the strata does not go to on non homogeneous 
part of m and s, then the standard deviation corresponding to the allocation is larger than f. Q ^ s(u)du. 
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high-probability upper confidence bound on the standard deviations, in order to improve 
the elegance of the proofs, and we refine their analysis. The algorithm takes as input two 
parameters b and /max which are linked to the distribution of the arms, 5 which is a (small) 
probability, and the partition Mk- We remind in Figure 1 the algorithm MC — UCB. 



Input: b, f max , 6, M K , set A = 2^(1 + 36 + Ifg ~) \og(2nK/6) 
Initialize: Sample 2 states in each strata, 
for t = 2K+ l,...,n do 

Compute B k f — - (ak,t-i + A^J 1 Ti ^ t for each stratum k < K 

Sample a point in stratum k± <E argmaxi<fc<i<- Bk t t 
end for 

Output: /}„ = J2k=l W kfrk,n 



Figure 1: The pseudo-code of the MC-UCB algorithm. The empirical standard deviations 
T l,t 



and means o\ t and fik,t are computed using Equations 9 and 10. 



The estimates of o\ t _ x and fik,t-i are computed according to 



^2 

a k±-\ 



, J-k.t-l 

^2 {X k ,i-^t-i? , (9) 



Tk,t- 1 



i=l 

and 

1 

Vk,t~i 



L k,t-1 



Yl x w ■ ( l0 ) 



' i=i 

4.2. Upper bound on the pseudo-regret of algorithm MC-UCB. 

We first state the following Assumption on the noise tt- 

Assumption 2 There exist b > such that Vx 6 [0, l] d , Vi, and VA < ^, 

A 2 \ . _ r , .,1 / A 2 



exp(Aet) < exp ( _ ), and exp(Ae 2 - A) < exp ( — 



2(1 - Aft) 



This is a kind of sub-Gaussian assumption, satisfied for e.g., Gaussian as well as bounded 
distributions. We also state an assumption on / and s. 

Assumption 3 The functions f and s are bounded by /max- 

Note that since the functions / and s are defined on [0, l] d , if Assumption 1 is satisfied, 
then Assumption 3 holds with / max = max(/(0), s(0)) + Md a l 2 . We now prove the following 
bound on the pseudo-regret. Note that we state it on partitions Nk, but that it in fact 
holds for any partition in K strata. 
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Proposition 4 Under Assumptions 2 and 3, on partition Nk, when n > 4K , we have 

nRnMAAMC-UCB)] < 2A^ Nk V(l + 3b + 4/2 a J (^^) ^7loi(^+— ^ 

The proof, given in the Supplementary Material (Appendix A), is close to the one of 
MC-UCB in (Carpentier and Munos, 2011a). But an improved analysis leads to a better 
dependency in terms of number of strata K. We remind that in paper (Carpentier and 
Munos, 2011a), the bound is of order 0(-fTn~ 4 / 3 ). This improvement is crucial here since 
the larger K is, the closer X/v K is from Jj Q ^ d s(x)dx. The next Subsection states that the 

rate K 1 /' i O{n~ A ^) of MC-UCB is optimal both in terms of K and n. 
4.3. Lower Bound 

We now study the minimax rate for the pseudo-regret of any algorithm on a given partition 
Nk- Note that we state it for partitions Nk, but that it holds for any partition in K strata 
of equal measure. 

Theorem 5 Let K € N. Let inf be the infimum taken over all online stratified sampling 
algorithms on Nk and sup represent the supremum taken over all environments, then: 

if 1 / 3 

mf swpE[Rnj/ K ] > C— w , 
where C is a numerical constant. 

Proof [Sketch of proof (The full proof is reported in Appendix D)] We consider a partition 
with 2K strata. On the K first strata, the samples are drawn from Bernoulli distributions 
of parameter fi^ where fik £ {-I'A^S^}, and on the K last strata, the samples are drawn 
from a Bernoulli of parameter 1/2. We write a = \l /x(l — /i) the standard deviation of 
a Bernoulli of parameter fi. We index by v a set of 2 K possible environments, where 
v = (vx, . . . , vk) £ {— 1, and the K first strata are defined by ^ = H> + v k % ■ Write 

P CT the probability under such an environment, also consider P CT the probability under which 
all the K first strata are Bernoulli with mean /i. 

We define fl v the event on which there are less than y arms not pulled correctly for 
environment v (i.e. for which Tk, n is larger than the optimal allocation corresponding to 
\i when actually ^ = ^, or smaller than the optimal allocation corresponding to /x when 
fik = 3^). See the Appendix D for a precise definition of these events. Then, the idea 
is that there are so many such environments that any algorithm will be such that for at 
least one of them we have ¥ a (^} v ) < exp(—K/72). Then we derive by a variant of Pinsker's 
inequality applied to an event of small probability that F v (£l v ) < XL (^' P ") = 0( (r3 ^ n ). 
Finally, by choosing a of order (^-) 1//3 , we have that P W (Q£) is bigger than a constant, and 
on J7£ we know that there are more than y arms not pulled correctly. This leads to an 

expected pseudo-regret in environment v of order 0( ~ K 4 1 / 3 3 ). ■ 

This is the first lower-bound for the problem of online stratified sampling for Monte- 
Carlo. Note that this bound is of same order as the upper bound for the pseudo-regret of 
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algorithm MC-UCB. It means that this algorithm is, up to a constant, minimax optimal, 
both in terms of the number of samples and in terms of the number of strata. It however 
holds only on the partitions Mk (we conjecture that a similar result holds for any measurable 

/ 2/3 \ 

partition J\f, but with a bound of order Q( Ylx&M ^f/w ) )■ 



5. Best trade-off between Q n ,Af K and R n ,Af K (-AMC-ucB) 
5.1. Best trade-off 

We consider in this Section the hyper-cubic partitions ATk as defined in Definition 2, and 
we want to find the best number of strata K n as a function of n. Using the results in 
Section 3 and Subsection 4.1, it is possible to deduce an optimal number of strata K to give 
as parameter to algorithm MC — UCB. Note that since the performance of the algorithm 
is defined as the sum of the quality of partition Mr, i.e. Q n ,M K and of the pseudo-regret of 
the algorithm MC-UCB, namely R n ,Af K (A.MC-UCB), one wants to (i) on the one hand take 
many strata so that Q n ,Af K is small but (ii) on the other hand, pay attention to the impact 
this number of strata has on the pseudo-regret R n ,jV K (^MC-UCB)- A good way to do that 
is to choose K n in function of n such that Q n j^ Kn and R n j^ K (Amc-L/cb) are of the same 
order. 

Theorem 6 Under Assumptions 1 and 2 (since on [0, l] d , Assumption 1 implies Assump- 
tion 3, by setting / max = X(l) + Md a/2 ), choosing K n = [[{n d ^) l / d \ s j (< 

n d + 3a < n), 

we have 

1 / /" 2a i 1 / _ d+4a _ a 

E[L n (A M c-UCB)}- -( / s(x)dx) < Cd3d + 5 v / log(n)n ^(l + cTn "+3a), 

1 /3 

where c = 70(1 + M)S A r A V( 1 + 3& + 4(/(0) + s(0) + M) 2 ) ( (/(0)+s( ° )+M)+4 ^ 
This leads to, if d <C n, the simplified bound is 

1 / I \ 2 _ d+4a 

E[L n (A M c~UCB)\- - { / s(x)dx) = 0(n 



J 



n 



[0,1]« 



Proof [Proof of Theorem 6] The definition of K n implies that K n > ((n d + :ia — l) 1 ^") > 



d / i \ d 



n d+3a I i - . Also, trivially, K n < n d + 3a . By plugging these lower and upper 

bounds, in respectively Q n j^ Kn and R n ,j\f Kn > we obtain the the final bound. ■ 
We can also prove a matching minimax lower bound using the results in Theorem 5. 

Theorem 7 Let sup represent the supremum taken over all a— Holder functions and inf 
be the infimum taken over all algorithms that partition the space in convex strata of same 
shape, then the following holds true: 

1 / f \ 2 d+ia 

inf supEL n (^4) / s(x)dx) =Q(n d +3 a ). 

n ^J[o,i] d J 
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Proof [Proof of Theorem 7] This is a direct consequence of Theorem 5 and the second 
comment of Subsection 3.1. ■ 



5.2. Discussion 

Optimal pseudo-risk. The dominant term in the pseudo-risk of MC-UCB with proper 
number of strata is ^^^^ = I ( J* d s(x)dx^ 2 (the other term is negligible). This means 
that algorithm MC-UCB is almost as efficient as the optimal oracle strategy on the optimal 
oracle partition. In comparison, the variance of the estimate given by crude Monte-Carlo is 
J[o i] d ~~ Jjo i] d f( u )d u ) dx + Jj Q yd s(x) 2 dx. Thus MC-UCB enables to have the term 

coming from the variations in the mean vanish, and the noise term decreases (since by 
Cauchy-Schwarz, ( Jj Q ^ s(x)dx^ < Jj 1 j d s(x) 2 dx). 

minimax-optimal trade-off for algorithm MC-UCB. The optimal trade-off on the 

d 

number of strata K n of order n d + 3a depends on the dimension and the smoothness of the 
function. The higher the dimension, the more strata are needed in order to have a decent 
speed of convergence for Ejv" K • The smoother the function, the less strata are needed. 
It is yet important to remark that this trade-off is not exact. We provide an almost minimax- 
optimal order of magnitude for K n , in terms of n, so that the rate of convergence of the 
algorithm is minimax-optimal up to a y/\og(n). 

Link between risk and pseudo-risk. It is important to compare the pseudo-risk L n (A) = 

2 2 

X)fc=i T k k an d the true risk E[(/i n — /x) 2 ]. Note that those quantities are in general not 
equal for an algorithm A that allocates the samples in a dynamic way: indeed, the quanti- 
ties Tfc n are in that case stopping times and the variance of estimate fi n is not equal to the 
pseudo-risk. However, in the paper (Carpentier and Munos, 2011b), the authors highlighted 
for MC — UCB some links between the risk and the pseudo-risk. More precisely, they es- 
tablished links between L n (A) and Ylk=i w k^'l(P'k,n ~ fJ-k) 2 ]- This step is possible since 
E[(/ife,n — fJ-k) 2 ] < fc k K[Tk t n], where T k n is a lower-bound on the number of pulls Tk >n on a 

— k,n 

high probability event. Then they bounded the cross products E[(/ifc i7l — fJ-k)(P-p,n — A*p)] and 
provided some upper bounds on those terms. A tight analysis of these terms as a function 
of the number of strata K remains to be investigated. 

Knowledge of the Holder exponent. In order to be able to choose properly the number 
of strata to achieve the rate in Theorem 6, it is needed to possess a proper lower bound on 
the Holder exponent of the function: indeed, the rougher the function is, the more strata 
are required. On the other hand, such a knowledge on the function is not always available 
and an interesting question is whether it is possible to estimate this exponent fast enough. 
There are interesting papers on that subject like (Hoffmann and Lepski, 2002) where the 
authors tackle the problem of regression and prove that it is possible, up to a certain extent, 
to adapt to the unknown smoothness of the function. The authors in (Gine and Nickl, 2010) 
add to that (in the case of density estimation) and prove that it is even possible under the 
assumption that the function attain its Holder exponent to have a proper estimation of 
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this exponent and thus adaptive confidence bands. An idea would be to try to adapt those 
results in the case of finite sample. 

MC-UCB On a noiseless function. Consider the case where s = almost surely, 
i.e. the samples collected are noiseless. Proposition 1 ensures that inf/y = 0: it is thus 
possible in this case to achieve a pseudo-risk that has a faster rate than O(^). If the function 
m is smooth, e.g. Holder with a not too low exponent a, it is efficient to use low discrepancy 
methods to integrate the functions. An idea is to stratify the domain in n hyper-rectangular 
strata of minimal diameter, and to pick at random one sample per stratum. The variance 
of the resulting estimate is of order 0( nli 1 2a/d ). Algorithm MC-UCB is not as efficient as a 
low discrepancy schemes: it needs a number of strata K < n in order to be able to estimate 
the variance of each stratum. Its pseudo-risk is then of order 0( nK \ a/d ). 
It is however only true when the observations are noiseless. Otherwise, the order for the 
variance of the estimate is in 1/n, no matter what strategy the learner chooses. 

In high dimension. The first bound in Theorem 6 expresses precisely how the perfor- 
mance of the estimate outputted by MC-UCB depends on d. The first bound states that 

the quantity L n (A) — j^Jjg ij<j s(x)dx\ is negligible when compared to 1/n when n is ex- 
ponential in d. This is not surprising since our technique aims at stratifying equally in every 
direction. It is not possible to stratify in every directions of the domain if the function lies 
in a very high dimensional domain. 

This is however not a reason for not using our algorithm in high dimension. Indeed, strati- 
fying even in a small number of strata already reduces the variance, and in high dimension, 
any variance reduction techniques are welcome. As mentioned in the end of Section 2, the 
model that we propose for the function is suitable for modeling d* dimensional functions 
that we only stratify in d < d* directions (and d<)i). A reasonable trade-off for d can also 
be inferred from the bound, but we believe that what a good choice of d is depends a lot 
of the problem. We then believe that it is a good idea to select the number of strata in the 
minimax way that we propose. Again, having a very high dimensional function that one 
stratifies in only a few directions is a very common technique in financial mathematics, for 
pricing options (practitioners stratify an infinite dimensional process in only 1 to 5 carefully 
chosen dimensions). 

6. Numerical experiment: influence of the number of strata in the 
Pricing of an Asian option 

We consider the pricing problem of an Asian option introduced in (Glasserman et al., 
1999) and later considered in (Kawai, 2010; Etore and Jourdain, 2010). This uses a Black- 
Scholes model with strike C and maturity T. Let (W(t))o<t<T be a Brownian motion. The 
discounted payoff of the Asian option is defined as a function of W, by: 



where So, r, and so are constants, and the price is defined by the expectation p = K]yF(W). 

We want to estimate the price p by Monte-Carlo simulations (by sampling on W). 
In order to reduce the variance of the estimated price, we can stratify the space of W. 
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Glasserman et al. (1999) suggest to stratify according to a one dimensional projection of 
W, i.e., by choosing a time t and stratifying according to the quantiles of Wt (and simulating 
the rest of the Brownian according to a Brownian Bridge, see (Kawai, 2010)). They further 
argue that the best direction for stratification is to choose t = T, i.e., to stratify according 
to the last time of T. This choice of stratification is also intuitive since Wt has the highest 
variance, the biggest exponent in the payoff (11), and thus the highest volatility. Kawai 
(2010) and Etore and Jourdain (2010) also use the same direction of stratification. We 
stratify according to the quantiles of Wt, that is to say the quantiles of a normal distribution 
A/"(0,T). When stratifying in K strata, we stratify according to the 1/K-th quantiles (so 
that the strata are hyper-cubes of same measure). 

We choose the same numerical values as Kawai (2010): So = 100, r = 0.05, s$ = 0.30, 
T = 1 and d = 16. We discretize also, as in Kawai (2010), the Brownian motion in 16 
equidistant times, so that we are able to simulate it. We choose C = 120. 

In this paper, we only do experiments for MC-UCB, and exhibit the influence of the 
number of strata. For a comparison between MC-UCB and other algorithms, see (Carpentier 
and Munos, 2011a). By studying the range of the F(W), we set the parameter of the 
algorithm MC-UCB to A = 150 log (n). 

For n = 200 and n = 2000, we observe the influence of the number of strata in Figure 2. 
We observe the trade-off that we mentioned between pseudo-regret and quality, in the sense 
that the mean squared error of the estimate outputted by MC-UCB (when compared to the 
true integral of /) first decreases with K and then increases. Note that, without surprise, 
for a large n the minimum of mean squared error is reached with more strata. Finally, note 
that our technique is never outperformed by uniform stratified Monte-Carlo: it is a good 
idea to try to adapt. 

7. Conclusion 

In this paper we studied the problem of online stratified sampling for the numerical inte- 
gration of a function given noisy evaluations, and more precisely we discussed the problem 
of choosing the minimax- optimal number of strata. 

We explained why, to our minds, this is a crucial problem when one wants to design 
an efficient algorithm. We enlightened the fact that there is a trade-off between having 
many strata (and a good approximation error, called the quality of a partition), and not 
too many, in order to perform almost as well as the optimal oracle allocation on a given 
partition (small estimation error, called pseudo-regret). 

When the function is noisy, the noise is the dominant quantity in the optimal oracle 
variance on the optimal oracle partition. Indeed, decreasing the size of the strata does not 
diminish the (local) variance of the noise. In this case, the pseudo-risk of algorithm MC- 
UCB is equal, up to negligible terms, to the mean squared error of the estimate outputted 

_ d+ia 

by the optimal oracle strategy on the best (oracle) partition, at a rate of 0(n d+3«) where 
a is the Holder exponent of s and m. This rate is minimax optimal on the class of a-H61der 
functions: it is not possible, up to a constant factor, to do better on simultaneously all 
a-H61der functions. 

We believe that there are (at least) three very interesting remaining open questions: 
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Mean squared error in function of the number 
of strata for n = 200 



Crude MC 
■■■Uniform Stratified MC 
— MC-UCB 




40 60 
Number of strata 



Mean squared error in function of the number 
of strata for n=2000 



0.54r 

0.53 

i 

! 0.52 



i 0.45 



0.44 L 



Crude MC 
- - - Uniform Stratified MC 
— MC-UCB 




20 



40 60 
Number of strata 



100 



Figure 2: Mean squared error for uniform stratified sampling for different number of strata, 
for (Left:) n=200 and (Right:) n=2000. 



• The first one is to investigate whether it is possible to estimate online the Holder 
exponent fast enough. Indeed, one needs it in order to compute the proper number 
of strata for MC-UCB, and the lower bound on the Holder exponent appears in the 
bound. It is thus a crucial parameter. 

• The second direction is to build a more efficient algorithm in the noiseless case. We re- 
marked that MC-UCB is not as efficient in this case as a simple non-adaptive method. 
The problem comes from the fact that in the case of a noiseless function, it is im- 
portant to sample the space in a way that ensures that the points are as spread as 
possible. An interesting problem is thus to build an algorithm that mixes ideas from 
quasi Monte-Carlo and ideas from online stratified Monte-Carlo. 

• Another question is the relevance of fixing the strata in advance. Although it is 
minimax-optimal on the class of a— Holder functions to have hyper-cubic strata of 
same measure, it might in some cases be more interesting to focus and stratify more 
finely at places where the function is rough. On that perspective, it could be more 
clever to have an adaptive procedure that also decides where to refine the strata. 
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Appendix A. Proof of Theorem 10 

A.l. The main tool: a high probability bound on the standard deviations 
Upper bound on the standard deviation: 

Lemma 8 Let Assumption 2 hold and n > 2. Define the following event 



fl { \ tin: Z! -£Z}*fcj 

l<k<K, 2<t<n ^ \ i=l j=l 



< A 




(12) 



iw/iere A = 2^(1 + 36 + 4V) log(2niv"/<5). T/ien Pr(f) > 1 - <5. 

Note that the first term in the absolute value in Equation 12 is the empirical standard 
deviation of arm k computed as in Equation 9 for t samples. The event £ plays an important 
role in the proofs of this section and a number of statements will be proved on this event. 
Proof Under Assumption 2 we have for /^ ax > max^ o\ with probability 1 — 5 because of 
the results of Lemma 15 



1 1 



0~k 



< 2 



:i + 36 + 4/Sax)log(2/<y) 



(13) 



Then by doing a simple union bound on (k,t), we obtain the result. 

■ 

We deduce the following corollary when the number of samples T^ t are random. 

Corollary 9 For any k = 1,...,K and t = 2K, . . . ,n, let be n i.i.d. random 

variables drawn from v^, satisfying Assumption 2. Let Tf-t be any random variable taking 
values in {2, ... ,n}. Let a\ t be the empirical variance computed from Equation 9. Then, 
on the event £, we have: 



\&k,t - Cfcl < A* 



/J_ 



(14) 



where A = 2^(1 + 36 + W) \og{2nK/8). 



A. 2. Main Demonstration 

We first state and prove the following Lemma and then use this result to prove Theorem 10. 



Theorem 10 Let Assumption 2 hold. For any < 8 < 1 and for n > AK , the algorithm 
MC- U CB launched on a partition Nk satisfies 



7i 
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Proof 

Step 1. Lower bound of order 0(n 2 / 3 ). Let k be the index of an arm such that Tfc n > 4i 
(this implies n > 3 as n > 4K, and arm k is thus pulled after the initialization) and let 
t + 1 < n be the last time at which it was pulled 6 , i.e., T^t = T^^ n — 1 and Tkt+i = T}. n . 
From Equation 14 and the fact that T^^ n > 4i, we obtain on £ 



S*,t < ^1 <T fc + 2AJ-^- I < ^ (15) 





n 

where the second inequality follows from the facts that Tk,t > 1, WkOk < £/V K , and lUfc < 
Sfc wjfc = 1- Since at time t + 1 the arm A: has been pulled, then for any arm q, we have 

B q ,t < B Kt . (16) 

From the definition of B q j, and also using the fact that T q< t < T qjTl , we deduce on £ that 



rrv-*/ f m 1 



Combining Equations 15-17, we obtain on £ 



< 



Finally, this implies on £ that for any q because = w q , 

( 2 A n\2/3 

/ \2/3 / n2/3 

This implies that \fq, T q , n > C { § J where C = maXfcfTfc+2A J . 
Step 2. Properties of the algorithm. We first remind the definition of B q ^ + \ used in 
the MC-UCB algorithm 

TV ( 

B qjt+1 = -s- a q>t + A 
Using Corollary 9 it follows that, on £ 

^<B, hl+l < ^(^ + 2,1,/— ). (10) 

1 Q,t 

Let t + 1 > 2K + 1 be the time at which an arm q is pulled for the last time, that is 
T q t = T qtn — 1. Note that there is at least one arm such that this happens as n > AK. Since 
at t + 1 arm q is chosen, then for any other arm p, we have 



n 





Bpfi+i < Bq,t+l ■ (20) 



6. Note that such an arm always exists for any possible allocation strategy given the constraint n = T q<rl . 
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From Equation 19 and T q j = T q ^ n — 1, we obtain on £ 



Furthermore, since T Pi t < 3pn> then on £ 

B„,, +1 > » > ». (22) 
Combining Equations 20-22, we obtain on £ 



Summing over all g such that the previous Equation is verified, i.e. such that T q ^ n > 3, on 
both sides, we obtain on £ 



p'-'p 



p ' n q\T q ,n>3 g\T q ,„>3 V v ■'" 



a „ + 2A. 



This implies 



P^{n - 3K) < £ L + 2A,/-i— 



WpO, 



(23) 



3= 

Step 3. Lower bound. Plugging Equation 18 in Equation 23, 



WpCT 



^(n - 3K) < £ w q (a q + 2^Li— 
Pi« q V V 9,n , 



2V2A if 1 / 3 



< ^N K + 

on £, since 7g n — 1 > ^ (as T gi „ > 2). Finally as n > 4K, we obtain on £ the following 
bound 

w p a p Y, Nk 4V2AK 1 ^ 12KE Mk 

T p , n ~ n ^ n 4 /3 n 2 ' 1 J 

Step 4. Regret. By summing and using Equation 24 which holds for all p, we obtain on 
£ (with probability 1 — 5) 



2 ~ 2 £^ 4E A/ vv / 2AK 1 /3 12^S 2 



j ^p®p , 



A* 



T p , n ~ n n 4 /3 n 2 
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This implies since EL n = E[L„I{£}\ + E[L n I{£ c }] and since 5 = rT 2 

n sfc n*/ 3 n 2 ^ v p ' 



p 

2 

n ' y^C « 4/3 ' n 2 
Since 5 = n~ 2 , we have A < 6a/(1 + 36 + 5F) log(nif) and C > ( /ma 4 x+4 ) this leads 

to 



n \ 4 / n 4 ^ n z 



Appendix B. Proof of Proposition 1 

Step 1: Expression of the variance of the stratified estimate. Note that the 
samples f(x) + s(x)et where et ~ u x and E^fet] = 0, Yu x [^t] = 1 the et are independent. 
We have 



4 



i 

w k Jn k 
1 



E Ux [(X x (t) -ix k ) 2 ]dx 



Wk Jn k 
1 



w k Jn k 
1 



Wk Jn 



(f(x) + s(x)e t 



(/(*) 



1 



1 

w k Jn 



1 

^fc Jn, 
f(u)du) 



f{u)duf 

1 



Wk Jn 



f(u)du) dx + 



dx + 
1 



^fc Jn k 



w k Jn k 

2, 



dx 



s(x) dx 



Step 2: Proof for the uniformly continuous functions. We first prove the result for 
a subset of L2([0, namely the set of functions m and s that are uniformly continuous. 

Proposition 11 If the functions f and s are uniformly continuous and if the strata satisfy 
the Assumptions of Proposition 1, we have 



/ y Wk,nO~k,n 
k 



[0,1]° 



s(x)dx — > 



Proof 

Let v > 0. As s and / are uniformly continuous, we know that Vx, 3n such that 
\s(x + u) — s(x)\ < v and \ f(x + u) — f(x)\ < v where u G B2,d('>]) 7 ■ 

By Assumption AS1, we know that Wkn < ^n- Note that the diameter of strata Vtk n is 



7. We denote by B2,d{v) the ball of center and radius rj according to the ||.||2 norm. 
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smaller than D(wk, n ) < D(v n ). Let us choose n big enough, i.e. such that D(v n ) < r\ and 

V n < V. 

We have 



Wk,nJn kn Wk,nJn kn ^Wk, n Jn kn ' Wk,nJn kn y Wk, n Jn kn 

=—/'(-—/ '•)'+— f '('-—[ '') 

Wk,n Jn k n V W k ,n Jn kn > Wk,n Jn Kn v w k , n Jn kn > 
< v 2 + v 2 < 2v 2 . 

Because of concavity of the square-root function, we get 

V2i 



2 



Ok,n ~ {— [ s) < y/2r. 
Wk,n Jn kn 



By summing we get 

s < V2v. 

,i] d 



Wk,n<?k,n ~ l 
k J i°' 



Step 3: Density of uniformly continuous functions in L2QO, We first remind 

a property of the functions in L2QO, l] d ). 

Proposition 12 The uniformly continuous functions according to the ||.||2 norm are dense 
mL 2 ([0,l} d ). 

Proof The result follows directly from the facts that 

• The continuous functions are dense in L2(fi) (Stone- Weierstrass Theorem). 

• The uniformly continuous functions on a compact space f2 according to the ||.||2 norm 
are dense in the space of continuous functions. 

• [0, l] d is a compact. 



This means that we can approximate with arbitrary precision according to the 1 1 . 1 1 2 measure 
on Zf2([0, an y function in L2QO, l] d ) by an uniformly continuous function. 
Using this proposition, we can prove the following Lemma. 

Lemma 13 For a given n and a given v, there exist two uniformly continuous function m v 
and s v such that: 



K n K n 



~ ^ \/ w k,n\ J (fv(x)+ / f v (u)duj dx / si 

k=\ k=l V ^ Qk ' n ^ Qk ' n Wk ' n ^ Uk ' n 



(x)dx 



< V. 
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Proof Let us fix n and v. 

Let m v be an uniformly continuous function such that 

(f(x) - f v (x)) 2 dx < mm(w kn )-, 

k Z 

and s v be an uniformly continuous function such that 

(s(x) - s v (x)) 2 dx < mm(w kn )--. 

k A 

It is possible because of Wk, n > and because the uniformly continuous functions are dense 
in L 2 ([0, l] d ) by Proposition 12. 
Note that we thus have 

(f(x) - f v {x)fdx < 



1 

w k,n Ju kn 

and 

1 



(s(x) - s v (x)) 2 dx < 

Wk,n jQ k>n * 

Note also that f (six) — s v (x)) 2 dx > —— f s(x) 2 dx — f s v (x) 2 dx 

Simple triangle inequality leads to 

— / {f{x) — / f{u)dufdx — [ (f v ( x )-— [ f v ( u )du) 2 dx 

Wk,nJVL kn w k,nJn kn Wk,nJn kn Wk,n JCl kn 



V 
< -. 

~ 2 



Now note that as u? = f (fix) — f f(u)du) 2 dx H — f n s(x) 2 dx, we 

know that the variance of the function on strata flk, n is arbitrarily close to the variance of 
its approximation. 
By convexity, one gets 



CT-fc.n-i/— — / f/v(a;) — / f v (u)du) dx-\ — I s 2 (x)dx 

V Wk,n Jn k n \ ^fc,n jQ fc n / Wk,n Ju- 



k.n 



< v. 



And finally, by summing 

K n K. 



Wk,nOk,n ~ yZ / (fv{x)+ I f v (u)du) dx — j s 2 v (x)dx 



< V. 



Step 4: Combination of all the preliminary results to finish the proof. 

we finish the demonstration of Proposition 1. 

Let v > and f v and s v be as in Lemma 13. 
We know that 

Kn 

/ y Wk,n&k,n 



Finally, 



k=l 



^2vwk~^J / (fv(x) + / 



f v (u)du) dx 



1 



W k ,n JQ. 



s 2 (x)dx 



< v, 
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and also that 

(s(x) - s v (x)) 2 dx < mm(w kl n)7: < 



J 

Ju 



in fc ' 2 2 

Note that by Cauchy-Schwartz: 



s(x) - s v (x)\dx <^J (s(x) - s v {x)) 2 dx < y ^. 
Note also that Proposition 11 tells us that 3n such that 
y2V w k,n\ [ (fv{x) — [ f v (u)du) dx+ I s 2 (x)dx- [ 

k=1 v jQk '" Wk ' n jQk > n jQk ' n J ^ 



[0,1] 



s v (x)dx < v. 



d 



When combining all those results, one gets the desired result. 

Note finally that if we choose the strata as being small boxes of size and side {j^) 1 ^-, 
then the assumptions of Proposition 1 is verified. 

Appendix C. Proof of Proposition 3 

Note first that 

o\ = — / (f(x) / f(u)du) dx H / s 2 (x)dx. 

The term in / As the function / is (a, M)— Holder, we know that V(x,y) S £1, \ f(x) — 

f(y)\<M\\x-y\\%. 

Using that we get 



— I (f(x)--f f(u)du) 2 dx < M 2 D(fl 
Wk Jn k v w k Jn k ' 



k) 

< M 2 d{^) 2a/d . 
K 



The term in s As the function s is (a,M)— Holder, we know that V(x,y) S fi, \s(x) 
s(y)\<M\\x-y\\%. 



s 2 (x)dx- (— / s(u)du) = — / (s(x) / s(u)du) dx < M 2 D(n k ) 

w k Jn k w k J Uk ' w k J Qk y w k J nk 



< M 2 d( — ) 2a / d . 
K 



Finally... By combining those two results 



Jn k y w k Ja k 

< w k ^J M 2 d(^) 2a / d + M 2 d{^) 2a / d . 
By summing over all the strata, one obtains 

HMk ~ [ s(x)dx < V2dM{^) a ' d . 
i[o,i] d K 
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Appendix D. Lower bound 

Let us write the proof of the lower bound using the terminology of multi-armed bandits. 
Each arm k represents a stratum and the distribution associated to this arm is defined as 
the distribution of the noisy samples of the function collected when sampling uniformly on 
the strata. 

Let us choose \i < 1/2 and a = ^. Consider 2K Bernoulli bandits (i.e., 2K strata 
where the samples follow Bernoulli distributions) where the K first bandits have parameter 
(Hk)i<k<K and the K last ones have parameter 1/2. The [i^ take values in {/j, — a, fi, n + a}. 

Define a 2 = /x(l — fj,) the variance of a Bernoulli of parameter fi, and is such that 

\\i < cr < We wite u_ Q and a +a the two other standard deviations, and notice that 

\\[J l < a -a < y/ji, and J\n < a +a < ^fji. 

We consider the 2 K bandit environments M(v) (characterized by v = (vk)i<k<K £ 
{— 1, +1} K ) defined by (fi^ = \i + VkQt)i<k<K- We write F v the probability with respect to 
the environment M(v) at time n. We also write M{a) the environment defined by all K 
first arms having a parameter a, and write P CT the associated probability at time n. 

The optimal oracle allocation for environment M(v) is to play arm k < K, t^iv) = 
CT " fc " — — — ti times and arm k > K, t^iv) = =^ — — — 77r n times. The corresponding 

quadratic error of the resulting estimate is l{v) = — i=1 (2K) 2 n ' ^ or ^ ne envrr onment 

M(cr), the optimal oracle allocation is to play arm k < K, t(a) = Xa+K/2 n ti mes (and arm 

1 /2 

k > K, t 2 {a) = Ka l K/2 n times). 

Consider deterministic algorithms first (extension to randomized algorithms will be dis- 
cussed later). An algorithm is a set (for all i = 1 to n — 1) of mappings from any sequence 
(n, . . . ,rt) £ {0, 1} of t observed samples (where r s £ {0, 1} is the sample observed at the 
s-th round) to the choice of an arm I t+ \ £ {1, . . . ,2K}. Write Tjt(ri, . . . , r n ) the (ran- 
dom variable) corresponding to the number of pulls of arm k up to time n. We thus have 

" >:i'\n. 

Now, consider the set of algorithms that know that the K first arms have parameter 
[J>k £ — a, /i, /U + a}, and that also know that the K last arms have their parameters in 
{1/4, 3/4}. Given this knowledge, an optimal algorithm will not pull any arm k < K more 

than {k^Z a+V3K/Aj n times - Indeed, the optimal oracle allocation in all such environments 
allocates less than ^— — < ^jzKji ) n samples to each arm k < K. In addition, since the 
samples of all arms are independent, a sample collected from arm k does not provide any 
information about the relative allocations among the other arms. Thus, once an arm has 
been pulled as many times as recommended by the optimal oracle strategy, there is no 
need to allocate more samples to that arm. Writing A the class of all algorithms that do 
not know the set of possible environments, A v the class of algorithms that know the set of 
possible environments M(v) and A opt the subclass of A„ that pull all arms k < K less than 

I 77 — ct+q at^;, ) n times, we have 



inf sup KR n > inf sup Ei? n = inf sup KR n , 
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where the first inequality comes from the fact that algorithms in A v possess more informa- 
tion than those in A, which they can use or not. Thus Ac A„. 
Now for any v = (yi, . . . , vk), define the events 

fit, = {u : \fU C {1, . . . , K} : \U\ < ^ and Vfc G U c , v k T k > v k t(a)}. 
Note that by definition 

n v =[j |J J { P| {v k T k < v k t(a)}} P| { p) {v k T k > v k t(a)}} \. 

P=iuc{i,...,K}-.\u\=p { keu ' ke u c ) 

By the sub-additivity of the probabilities, we have 



K 
3 



\(n v ) < E E 1 

P=1UC{1,...,K}:\U\= P 



{ p| {v k T k < v k t(a)}} P { P {v k T k > v k t(a)}} 



keu 



keU c 



The events I { f] keU {v k T k < v k t(a)}} f| | C\ km c{v k T k > vt(a)}} > are disjoint for dif- 



ferent v, and form a partition of the space, thus J2 V ^ 



{ n fceW H7fc < v k t{a)}) fl { [\ k&A c{vT h > 



Vkt((j)}} 



We deduce that 

K_ 

Ew^EE E 1 

v v p=lUc{l,...,K}:\U\=p 

K_ 

-i E E 

P=lWC{l,...,K}:|W|=p « 

3 

= E E i 

P=lWC{l,...,if}:|W|=p 



{ P {t;T fc < u fc t(a)}} P { P {v k T k > v k t(a)}} 
. keu keu c 

{ P {v k T k < v k t{o)}} P { P {vT k > v k t(a)}} 



keu 



keu c 



K_ 

3 



= E 



K 

P 



Since there are 2 K environments v, we have 



if 

minP CT (^) < ^ E P -(^) ^ ^ E ( 

w p=i ^ 



K 
P 
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Note that Ej=i ( ^ ) = P (Ef=i Xk - f ) where • ■ ■ > are K independent 
Bernoulli random variables of parameter 1/2. By Chernoff-Hoeffding's inequality, we have 
p (Ef=i x k < f ) = P(y EfcLi ^fc - 2 < f ) < exp(- J FC/72). Thus there exists u min such 
that P CT (^ min ) < exp(-K/72). 

Let us write p = Pt, min (^ min ) and p a = F a (Q Vinin ). Let fcZ(a,6) = alog(f) + (1 - 
a ) l°s(iEf ) denote the KL for Bernoulli distributions with parameters a and 6. Note that 
because Vfi, XL(P Umin (.|n),P CT (.|Q)) > 0, we have 

fc/(p,p a ) <KL(P Umin ,P CT ). 

From that we deduce that p(log(p) — log(p CT )) + (1 — p)(log(l — p) — log(l — p a )) < 
KL(F Vinin ,F a ), which leads to 

P < max(^(KL(P„ min ,P CT )),exp(-K/72)). (25) 

Let us now consider any environment (v). Let Rt = (r±, . . . ,rt) be the sequence of 
observations, and let P^ be the law of Rt for environment M(v). Note first that P„ = P™. 
Adapting the chain rule for Kullback-Leibler divergence, we get 

n 

= KL(Fl,Fl) + ^2H K'H^-i)KL{f v {.\R t - 1 ),Wi{.\R t )) 

t=2 R t -! 
n 

t=2 R t _!|^=+1 iit_i|v Jt =-l 

= kl(n-a, f i)E v [ £ T k ] + kl(fi + a, fi)E v [ T k \. 

k:v k = -l k:v k =+l 

We thus have, using the property that kl(a, b) < , 

KL^ v ,F ff ) = kl(n-a, f j)E v [ ^ T k ] + kl{fi + a,^)E v [ ^ T k \ 

k:v k = -l k:v k = + l 



k<K 



a 2 



a* 

k<K 
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Note that for an algorithm in A opt , we have Ylk=i Tk < — a ^^ K ^ 4 j n - Since 
a = % and < fi < h we have 



KL(F v ,F a ) < [K-_ )Kn 



Ka- a + s/ZKjV a 



a 2 



a 2 



< 4<j +a — n 

a 2 

< 8— n 

a 



We thus deduce using Equation 25 

18 / \ 

PWja, mi J =P < max(-(KL(F Vmin ,F a )),eM-K/72)) 

144 a 2 
K a 

Now choose o < ^(f ) x / 3 (as a = f = ^). Note that this implies that F Vmin (n Vinia ) < \. 

Let wfff. . We know that for ui, there are at least 4- arms among the K first which 
are not pulled correctly: either -g- arms among the arms with parameter /i — a or among 
the arms with parameter [i + a are not pulled correctly. Assume that for this fixed w, there 
are ^ arms among the arms with parameter /i — a which are not pulled correctly. Let U{lS) 
be this subset of arms. 

We write AT = YlkeU^k ~ Tr^(°"— a) the number of times those arms are over pulled. 
Note that on lj we have AT > ^-t(a) — t(a- a ). We have 

Am K . . K . . 1 Kg 1 Ko^ a 

AT = — tla) ticj-n,) = -^n ^ n 

6 KJ 6 1 a) 6Ka + K/2 ^lx^a + K/2 

1 Ka 1 Ka/V2 

^ ft — ft 

~ 6 Ka + K/2 6 ^3Ka/V2 + K/2 



1 1 1 

6K~a~TK/2VMaJV2 + K/2 



> -— — ' (K 2 o/2 - K 2 a/2\/2^Jn 



> ~(1-1/V2)an 
- 35 
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Thus on cu, the regret is such that 
^ w ^L rt ( w ) (2K)2 n 



> 1 g ^-a + ( ££=1 ^ - **-°/6 + K i 2 f 1 + K i 2 )' 



K 2 6 t fc (cj_ a ) +6AT/K (2K - K/6) 2 (n- AT) (2K) 2 n 

I (> '■' , rr„ ,, + /v/2T ' ' " 7 



i^/2) 2 L ^\ — 



V Ka -<* n A (j2ti^ 1 c t -Ka- a /6+K/2)n, 

1 + 



> 



> C 



{2K) 2 n 

( (E£i^c+^/2)AT \ / (E£i^q+^/2)Ar \ 
1 (E^l^a + ^Z 2 ) 2 ' (Ef.iV-&-o/6+if/2)n/\ (iO-c/e) n / 

(2i^) 2 n / 6Ar(E£i^ a +i^/2) \ / (e£i ^ a +x/ 2 ) at 

(AT) 2 



n 3 cr 



> c KV3 



n 



where C is a numerical constant. Note that for events cu where there are ^ arms among 
the arms with parameter /j, + a which are not pulled correctly, the same result holds. 
Note finally that . ) > 1/2. We thus have that the regret is bigger than 



"mm 
mm 

1 KV3 

" 2 n^' 

which proves the lower bound for deterministic algorithms. Now the extension to random- 
ized algorithms is straightforward: any randomized algorithm can be seen as a static (i.e., 
does not depend on samples) mixture of deterministic algorithms (which can be defined 
before the game starts). Each deterministic algorithm satisfies the lower bound above in 
expectation, thus any static mixture does so too. 
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Appendix E. Large deviation inequalities for independent sub-Gaussian 
random variables 

We first state Bernstein inequality for large deviations of independent random variables 
around their mean. 

Lemma 14 Let (X\, . . . , X n ) be n independent random variables of mean (pi, . . . , fi n ) and 
of variance (a 2 , . . . , cr^). Assume that there exists b > such that for any A < |, for any 

i <n, it holds that E 



exp(A(Xj — fjLi)) < exp ( 2 (l-A6) ) ' ^ en w ^h probability 1 — 5 



i=l i=l v 

Proof If the assumptions of Lemma 14 are verified, then 



n n 



i=i 



ex P ( A (E"=i -Xi - E"=i > exp(nAf ) 
'cx P fA(E?=i^-Er=iw) 



< E 



cxp(nAu) 



<nr=iE 



exp I X(Xi— m) 



exp(Au) 



< exp(^E^ =1 ^ 



(1-Afe) 



nXv) 



tarn 



By setting A = ^ n n \ , , — we obtc 

n n 

-(E^--Ef.>-)<-p( 2Kiiffi!+i , m , ) . 



n 2 v 2 



i=l i=l 
By an union bound we obtain 



'(iX^-X^i < 2ex p(- 



n 2 v 2 



i=l i=i 
This means that with probability 1 — 5, 



2(£r=i ^ 2 + &nt,) 



I- 1 x t - * ± »\ < \ 2{i g=1 ^ l0g(2/ ' } + &M2M - 

n ^— ' n ^— ' V re n 

i=l i=l ' 



We also state the following Lemma on large deviations for the variance of independent 
random variables. 
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Lemma 15 Let {X\, . . . , X n ) be n independent random variables of mean (pi, . . . , fj, n ) and 
of variance (erf, . . . , ex 2 ). Assume that there exists b > such that for any A < |, for any 

( .W-' 

< exp 



i < n, it holds that E 



exp(A(Xj - /ij)) 



A<r? 



- ex P ^2(i-A6) ; ■ 

iei V = - Ylii^i ~ n Si Mi) 2 + n ^ ^ e var ^ ance °f a sample chosen uniformly at 
random among the n distributions, and V = ^ Y17=i {p^i ~ h Xj=i Xj) 2 the corresponding 
empirical variance. Then with probability 1 — 6, 



2(1-A6) 



and also E 



exp(A(Xj - m 



V\<2 



(1 + 36 + 4F) log(2/5) 



n 



Proof By decomposing the estimate of the empirical variance in bias and variance, we 
obtain with probability 1 — 5 



n ^-^ n L — ' n ^-^ n L — ' 

i j i i 

=1 E(* - *) a + 2 ^ - Ec* - £ E /*) 

i i i j 

=-E(^-^) 2 + 1 E^- i E^) 2 -(^E^- 1 E^ 

i i j i i 

We then have by the definition of V that with probability 1 — 6 

^-^=-E(*-^) 2 --E^-( 1 E jr '- 1 Ew) 2 - 



(26) 



If the assumptions of Lemma 15 are verified, we have with probability 1 — 6 



n n 

>(E(*-") a -E«^™ 



exp ( A(y~] |Xj - //i| 2 - E " 2 )) - ex P( nAt; ) 



i=l 



1=1 



< E 



exp A(ELil^-^l 2 -ELi^ 



< 



i=l 



exp(nAu) 
exp - - a 2 ; 



exp(Ai;) 



If we take A 



we obtain with probability 1 — 5 



nXv). 



n n 

Y J (x l - w) 2 - E a i ^ ny2 ) ^ ex p(- 



n 2 v 2 



i=l 



i=l 



2(Er=i^ 2 +^)- 



(27) 
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By a union bound we get with probability 1 — 5 that 



n 2 v 2 



t=l i=l 
This means that with probability 1 — 5 



n n V n n 

i=i i=i 



(28) 



Finally, by combining Equations 26 and 28 with Lemma 14, we obtain with probability 
1-5 



|t > r| ^ 4(iEr = i^)log(2/g) , 2b 2 log(2/ ( 5) 2 | / 2(^E? = i^)log(2A) , hlog(2/g) 



< / 2(^Er=i^ 2 )log(2/^) + (3& + 4iEr=i^ 2 )log(2/^) 
~~ V n n 



; 2V\og{2/5) | (36 + 4y)log(2/^) 
~~ V n n 

when n > &log(2/<5) and because V > ^ Ei=i °f • 
This implies with probability 1 — 6 that 



r _ ^ 2yiog(2/J) + log(2/J) (3b + 4V)log(2/5) + log(2/<S) 

n 2n n 2n 



vT _ / log(2/*) < K + (l + 36 + 4tQlog(2/J) 
2n V n 



t . log(2/g) ^ , /(I + 36 + 4F) log(2/<5) 



2n V n 

x t , + 2 1 ' (l + 36 + 4y)log(2/5) 



n 

On the other hand, we have also with probability 1 — 5 



I ' < i - H /2Vdog(2/5) (3fr + 4F)log(2/,5) 



n n 



V n 
Finally, we have with probability 1 — 5 



f-W\<2J ^ + 3b + W ^ 6 \ 
V n 



(29) 
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